arXiv:1501.04849v4 [stat.AP] 15 Sep 2015 


Bayesian Gaussian Copula Graphical Modeling 
for Dupuytren Disease 


Abdolreza Mohammadi 
University of Groningen 
a.mohammadiSrug.nl 


Fentaw Abegaz 
University of Groningen 
f.abegaz.yazewOrug.nl 


Edwin van den Heuvel Ernst G. Wit 

Eindhoven University of Technology University of Groningen 
e.r.V.d.heuvelOtue.nl e.c.witOrug.nl 


September 16, 2015 


Abstract 

Dupuytren disease is a fibroproliferative disorder with unknown eti¬ 
ology that often progresses and eventually can cause permanent contrac¬ 
tures of the affected fingers. In this paper, we provide a computationally 
efficient Bayesian framework to discover potential risk factors and inves¬ 
tigate which fingers are jointly affected. Our Bayesian approach is based 
on Gaussian copula graphical models, which are one potential way to dis¬ 
cover the underlying conditional independence structure of variables in 
multivariate mixed data. In particular, we combine the semiparametric 
Gaussian copula with extended rank likelihood which is appropriate to 
analyse multivariate mixed data with arbitrary marginal distributions. 
For the graph structure learning, we construct a computationally efficient 
search algorithm which is a trans-dimensional MGMC algorithm based on 
a birth-death process. In addition, to make our statistical method easily 
accessible to other researchers, we have implemented our method in C++ 
and interfaced with R software as an R package BDgraph which is available 
at http://CRAN.R-project.org/package=BDgraph 

Keywords: Dupu3d;ren disease; Risk factors; Bayesian inference; Gaus¬ 
sian copula graphical models; Bayesian model selection; Latent variable 
models; Birth-death process; Markov chain Monte Carlo. 


1 Introduction 

Dupuytren disease is a hereditary disorder that is present worldwide. It is how¬ 


ever more prevalent in people with northern European ancestry (Bayat and 


McGrouther, 20061. The disease is an incurable fibroproliferative disorder that 


alters the palmar fascia of the hand and may causes progressive and perma¬ 
nent flexion contracture of the fingers. Initially, skin pittings and subcutaneous 
nodules appear in the palm; see Figure |l(a)| At a later stage, cords appear 
that connect the nodules and may contract the fingers into a flexed position; 
see Figure |l(b)[ Contracture can arise in a single ray or in multiple rays. The 
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disease mostly appears on the ulnar side of the hand, i.e., it affects the pink 
and ring fingers most-frequently (see Figure]^. The only available treatment 
is surgical intervention. Although much is known about the disease, the ques¬ 
tions arising are: (1) What variables affect the disease and in what way? (2) 
Should surgical intervention focus on single or on multiple fingers? The first is 
an epidemiological question, the second is a clinical one. 



Figure 1: (a) hand image of a patient with Dupuytren disease whose fingers have 
not been affected by the disease. Palmar nodules and small cords have no signs of 
contracture, (b) hand image of a patient with Dupuytren disease whose finger have 
been affected by the disease. 


Empirical research has described the patterns of occurrence of Dupuytren 
disease in multiple fingers. Meyerding (1936] | have stated that most often the 
combination of affected ring and little hngers occurred, followed by the combi¬ 
nation of an affected third, fourth, and hfth hnger. Tubiana et al. (1982[ ) found 
that Dupuytren disease rarely affects radial side alone, and that the radial affect 
is often associate with an affected ulnar side. Milner (20031 noticed that patients 
who had required surgery because of an affected thumb were on average 8 years 
older and had suffered significantly longer from the disease compared to patients 
with a mildly affected radial side. Moreover, these patients suffered from ulnar 
disease that repeatedly had required surgery, suggesting an intractable form of 
disease. More recently, banting et al. (20141, with a multivariate ordinal logit 
model, suggested that the middle hnger is substantially correlated with other 
hngers on the ulnar side, and the thumb and index hnger are correlated. They 
took into account age and sex, and tested hypotheses on independence between 
groups of hngers. However, so far, no serious multivariate analysis of the disease 
has been performed taking into account potential risk factors. 

Essential risk factors of Dupuytren disease include both phenotypic and 
genotypic factors (Shih and Bayat, 20101, such as genetic predisposition and 
ethnicity, as well as sex and age. However, it is unclear whether Dupuytren dis¬ 
ease is a complex oligogenic or a simple monogenic Mendelian disorder. Several 
life-style risk factors (some considered controversial) include smoking, excessive 
alcohol consumption, manual work, and hand trauma. Also several diseases, 
such as diabetes mellitus and epilepsy, are thought to affect the life-style factors 
risk of incurring Dupuytren disease. However, the role of these life-style factors 
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and diseases has not been fully elucidated, and the results of different studies 


are occasionally conflicting (banting et ah, 2014). 


In this paper we analyze data collected by the Department of Plastic Surgery 
of the University Medical Center Groningen in the north of Netherlands from 
patients who have Dupuytren disease. The data are originally described by 
banting et al. (2013) and banting et al. (2014| |. Both hands of patients are ex¬ 
amined for signs of Dupuytren disease. These are tethering of the skin, nodules, 
cords, and finger contractures in patients with cords. Severity of the disease 
is measured by the angles on each of the 10 fingers. Recorded potential risk 
factors include smoking habits, alcohol consumption, whether participants had 
performed manual labor during a significant part of their life, and whether they 
had sustained hand injury in the past, including surgery. In addition, infor¬ 
mation about the presence of bedderhose diabetes, epilepsy, peyronie, knuckle 
pad, liver disease, and familial occurrence of Dupuytren disease, defined as a 
first-degree relative with Dupuytren disease, was collected. 

The primary aim of this paper is to model the relationships between the 
risk factors and disease indicators for Dupuytren disease based on the mixed 
dataset. We propose a computationally efficient Bayesian statistical method 
based on Gaussian copula graphical models (GGGMs) for discovering the joint 
conditional independence structure of binary, ordinal or continuous variables 
simultaneously. Our Bayesian framework is based on that proposed by |Dobra| 
and benkoski (2011). In GGGMs, the graph selection procedure is embedded 


inside a semiparametric framework, using the extended rank likelihood (Hoff, 


2007). In this paper we design our proposed Bayesian framework for GGGMs 


based on a computationally efficient search algorithm, using a trans-dimensional 


MGMG approach based on a continuous-time birth-death process (Mohammadi 


and Wit, 2015b). 


In this work, we follow a similar approach as that of [Mohammadi and Wit| 


(2015b) by using their birth-death MGMG algorithm. Ghanges have been made 


to allow more general data structures of mixed type using copula graphical mod¬ 
els. Importantly we have made substantial improvements in their algorithm to 
overcome the computational bottle-neck in evaluating normalizing constants. 
Furthermore, our proposed approach can handle missing data without any addi¬ 
tional computational effort, if the missingness is completely at random (MGAR). 

In Section we illustrate our Bayesian framework based on Gaussian cop¬ 
ula graphical models. In addition, we show the performance of our method 
and we compare it to state-of-the-art alternatives. In section we analyze 
the Dupuytren disease dataset based on the proposed Bayesian birth-death 
MGMG method. In this section, first analyze potential phenotype risk factors 
for Dupuytren disease. Moreover, we analyze the relationship between consider 
the severity of Dupuytren disease of pairs of fingers. The result helps surgeons 
to decide whether they should operate one finger or they should operate multi¬ 
ple fingers simultaneously. Finally, we discuss the connections between existing 
methods and possible future directions. 


2 Methodology 


Graphical models ( jbauritzen, 1996 ) provide an effective way to describe statisti¬ 
cal patterns in multivariate data. In this context undirected Gaussian graphical 
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models are commonly used, since inference in such models is often tractable. In 
undirected Gaussian graphical models, the graph structure is characterized by 
its precision matrix (the inverse of covariance matrix): the non-zero entries in 
the precision matrix show the edges in the conditional independence graph. In 
the real world, data are often non-Gaussian, like the dataset considered in our 
application. For non-Gaussian continuous data, variables can be transformed 
to Gaussian latent variables. For discrete data, however, there is no one-to- 
one transformation into latent Gaussian variables. A common approach is to 
apply a Markov chain Monte Carlo method (MCMC) to simulate both the la¬ 


tent Gaussian variables and the posterior distributions (Hoff, 2007). Another 


Bayesian approach is the Gaussian copula graphical models developed by |Dobr^ 
and Lenkoski (20111, in which the sampler algorithm is based on reversible-jump 
MCMC and a Cholesky decomposition of the precision matrix. Alternatively, 


our proposed method implements the birth-death MCMC approach (Moham- 


madi and Wit, 2015b I which has several computational advantages compared 


to reversible-jump MCMC approach as we show in our simulation examples. 

For studying the multivariate dependence structure of finger contractures 
and risk factors, we are interested in exploring the conditional independence 
graph space and identifying conditional independence graphs that are most ap¬ 
propriate for our given data. In this regard, we calculate the posterior distribu¬ 
tion of the conditional independence graph G conditional on data 


Pr(G|data) = 


Pr(G)Pr(data|G) 


Eoee Pr{G)Pr{data\G) ’ 


in which Q is the conditional independence graph space. Computing this pos¬ 
terior distribution is computationally unfeasible, since in the denominator we 
require the sum over all possible graphs. The graph space increases super- 
exponentially with the dimension of the variables. For p nodes in a conditional 
independence graph, there are p{p — l)/2 possible edges, and hence we have 
2 p(p-i )/2 different possible graphs corresponding to all combinations of individ¬ 
ual edges being in or out of the graph. For example, in our data we have 23 
variables (p = 23), resulting in a total number of possible conditional indepen¬ 
dence graphs of more than 10^®. This motivates us to develop effective search 
algorithms for exploring graphical model uncertainty. In order to be accurate 
and scalable, the key is to design computationally efficient search algorithms 
that are able to quickly move towards high posterior probability regions, and to 
take advantage of local computations. 


2.1 Gaussian copula graphical models 


In graphical models, conditional dependence relationships among random vari¬ 
ables are presented as a graph G. A graph G = {V, E) specifies a set of vertices 
V = {1, 2, ...,p}, whe re each vertex co rresponds to a random variable, and a set 
of edges E CV xV (Lauritzen, 1996). E denotes the set of non-existing edges. 
We focus here on undirected graphical models, where {i,j) & E ^ {j,i) € E, 
also known as Markov random fields. The absence of an edge between two 
vertices specifies the pairwise conditional independence of these two variables 
given the remaining variables, while an edge between two variables determines 
the conditional dependence of the variables. In our application, for example. 
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disease risk factors (such as disease factors, alcohol, and hand injury) will be 
the nodes, and dependencies among them will be the edges. 

A graphical model that follows a multivariate normal distribution is called a 


Gaussian graphical models, also known as a covariance selection model (Demp¬ 


ster, 19721. Zero entries in the precision matrix correspond to the absence of 


edges on the graph and conditional independence between pairs of random vari¬ 
ables given all other variables. We define a zero mean Gaussian graphical model 
with respect to the graph G as 


A^g = {A4.(0,E) |K = E-iePG}, 

where Pg denotes the space of p x p positive definite matrices with entries 
{i,j) equal to zero whenever (i,j) G E. Let z = (z^,...,z") be an independent 
and identically distributed sample of size n from model Aic-, where z* is a p 
dimensional vector of variables. Then, the likelihood is 

P(z|iL, G) cx \K\^/^ exp |-itr(iL5)| , (1) 


where S = z'z. 

In line with extending the idea of Gaussian graphical modeling for non- 
Gaussian data, the Gaussian copula has been considered for graphical modeling 
among a set of mixed variables (Dobra and Lenkoski, 2011). A copula is a 


multivariate cumulative distribution function whose uniform marginals are on 
the interval [0,1]. Copulas provide a flexible tool for understanding dependence 
among random variables, in particular for non-Gaussian multivariate data, for 
example, the type of data application we consider in this study which consists 
of 23 mixed variables (13 phenotype risk factors and 10 variables on the severity 
of Dupuytren disease in all the hand fingers) of the type discrete, binary, and 
ordered categorical variables; see Section |3.1[ 


By Sklar’s theorem (Sklar, 1959) there exists a copula G such that any p 


dimensional distribution function H can be completely specified by its marginal 
distributions Fj, j = 1,... ,p, and a copula G satisfying 


H{yi,...,yp) = G {Fi{yi),Fp{yp)). 


Conversely, a copula function can be extracted from any p dimension distribu¬ 
tion function F[ and marginal distributions Fj by 

C(ui,..., Up) = 77 (Fri(yi),..., F-\yp)) , 

where F~^{s) = inf{t | Fj{t) > s} are the pseudo-inverse of Fj. The Gaussian 
copula is given by 

G{ui,...,Up I T) = ($"i(ui),.. .,$”^(Mp) I 0) , 

where Uj = Fj{yj), j = l,...,p, $p(-) is the cumulative distribution of a 
multivariate normal distribution and $(•) is a cumulative distribution function 
of a univariate normal distribution. 

The decomposition of a joint distribution into marginal distributions and 
a copula suggests that the copula captures the essential dependence features 
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between random variables. Moreover, the copula measure of dependence is in¬ 
variant to any monotone transformation of the random variables. Thus, copulas 
allow one to model the marginal distributions and the dependence structure of 
a multivariate random variables separately. In copula modeling, [Genest et al.| 
|(1995[ ) develop a popular semiparametric estimation approach or rank likeli¬ 
hood based estimation in which the association among variables is represented 
with a parametric copula but the marginals are treated as nuisance parameters. 
The marginals are estimated non-parametrically using the scaled empirical dis¬ 
tribution function Fj{y) = where Fn^{y) = < v}- 

As a result estimation and inference are robust to misspecihcation of marginal 
distributions. 

However, for discrete data, where the distribution of ranks depends on the 
univariate marginal distributions, the use of the semiparametric estimator is 
somewhat inappropriate for the analysis of mixed continuous and discrete data 


(Hoff, 20071. To overcome this, he proposes the extended rank likelihood, which 


is a type of marginal likelihood approach where the ranks are free of the marginal 
distributions of the discrete data. This makes the extended rank likelihood ap¬ 
proach more appropriate for graphical modeling that avoids the difficult problem 
of modeling the marginal distributions. 

Let T be a collection of continuous, binary, ordinal or count variables with 
Fj the marginal distribution of Yj and F~^ its pseudo inverse. For constructing 
a joint distribution of Y, we introduce a multivariate normal latent variable as 
follows 

Zi,...,Z„lV(0,S), 


where S is the correlation matrix. We define the observed data as 


y,, = F-\^z,,)). 

A Gaussian copula-based joint cumulative distribution of Y is given by 

P{Yi<yi ,... ,yp<yp) = $p(d>-i(Fi(2/i)),..., | T). (2) 

Our aim is to infer the underlying graph structure G of the mixed variables 
Y implied by the continuous latent variables Z. The observed mixed data from 
a sample of n observations y are related to the latent samples z that belong to 
the set 

A(y) = {z e : max : yf < j/f } < < ^^in {zf : yf ^ } , 

r,fc,s= l,...,n;j = l,...,p}. (3) 

It follows that inference on the latent space can be performed by substituting 
the observed data y with the event z S A(y). For a given graph G and precision 
matrix K = the extended rank likelihood is defined as 


P{^&A{y)\K,G) = P{ 2 .&A{y)\K,G)= [ Piz\K,G)dz, (4) 

JAiy) 

where the expression inside the integral for the Gaussian copula based proba¬ 
bility function given by takes a similar form as in In the next sections 
we develop a Bayesian approach based on the extended rank likelihood given in 
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2.2 Bayesian Gaussian copula graphical models 

In the Bayesian framework that follows, we infer about graph and precision 
matrix {K, G) by considering a posterior distribution 


P{K,G\z e .4(y)) cx P(z e Aiy)\K,G)PiK \ G)P{G), 


(5) 


where P{z G A{y)\K, G) is the extended rank likelihood defined in @,P{K\G) 
denotes a prior distribution of a precision matrix K for a given graph structure 
and -P(G') denotes a prior distribution for a graph G. 

In particular, we develop a simple and efficient continuous birth-death Markov 
chain Monte Carlo (MCMC) algorithm for the posterior computation that con¬ 


verges much faster than the reversible MCMC algorithm in Dobra and Lenkoski 


(2011) and overcomes the computational bottle-neck in Mohammadi and Wit 


(2015b). Moreover, we evaluate the results induced by the latent variables using 
posterior predictive analysis on the scale of the original mixed variables. 


2.2.1 Prior specification 

In what follows we briefly describe the specification of prior distributions for 
the graph G and the precision matrix K. For the prior distribution of the 
graph, we propose to use a discrete uniform distribution over the graph space, 
as a non-informative prior. We also note that other choices of priors for the 
graph structure have been considered by modeling the joint state of the edges 


(Scutari, 2013), encouraging sparse graphs (Jones et ah, 2005) or a truncated 


Poisson distribution on the graph size (Mohammadi and Wit, 2015b). 


We consider the G-Wishart (Roverato, 2002) distribution as prior distribu¬ 
tion of the precision matrix. The G-Wishart is the Wishart distribution re¬ 
stricted to the space of precision matrices with zero entries specihed by a graph 
G, Pq. The G-Wishart density for K gFq ^ Wcib,D) can be written as 

P{K\G) = ^-^|iy|(''-2)/2exp |-itr(DiF)| , 

where 5 > 2 is the degree of freedom, D is a symmetric positive dehnite matrix, 
and Icib-, D) is the normalizing constant, 

lG{b,D)= J |Rr|('’-2)/2exp|-itr(DR:)|dR:. 

Dealing with this normalizing constant has been a major issue in recent litera¬ 
ture; see Section [2. 2. 3[ 

Since the G-Wishart prior is conjugate to normally distributed data ([^, the 
posterior distribution of K condition on graph G is 

P{K\Z G A{y),G) = exp |-^tr(D*iG)|, 

where b* = h+n and D* = D + S with S = z'z, that is a G-Wishart distribution, 


Wcib*, D*). For other choices of priors for the precision matrix see Wang and 
Pillai (2013D , |Wang et al. (2015D , [W^g (2012D , [Wong et al. (2003[ ). 
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2.2.2 Sampling algorithm for posterior inference 

Sampling from the joint posterior distribution ([^ can be done by a computa¬ 
tionally efficient birth-death MCMC algorithm proposed in [Mohammadi and] 
Wit (2015b I for Gaussian graphical models. Here we extend their algorithm for 
the more general case of Gaussian copula graphical models. Our algorithm is 
based on a continuous time birth-death Markov process in which the algorithm 
explores the graph space by adding or removing an edge in a birth or death 
event, respectively. The birth and death rates of edges are determined by the 
stationary distribution of the process. The algorithm is designed in such a way 
that the stationary distribution equals to the target joint posterior distribution 
of the graph and the precision matrix ([^. The time between two successive 
events has an exponential distribution. Therefore, the probability of birth and 
death events are proportional to their rates. 

Mohammadi and Wit (2015b[ section 3) proved that the birth-death MCMC 


(BDMCMC) algorithm converges to the target joint posterior distribution of the 
graph and the precision matrix, by considering the following birth and death 
rates, 


Pe{K) 


P{G,K\k„\Z &A{y)) 


for each e G E, 


( 6 ) 


P(G-^K--\%|Z€M(y)) 

P{G,K\{h^,kjj)\ZGA{y))^ 


for each e G E, (7) 


in which = {V, E U {e}) for the inclusion of an edge from the graph G, and 
e Pg+e represents an updated precision matrix when an edge is included, 
and similarly G“® = {V,E \ {e}), and K~^ G Po-e represent exclusion of 
an edge and its updated precision matrix. We refer our proposed approach 
as the extended birth-death MCMC (EBDMCMC) algorithm. Details of the 
EBDMCMC algorithm are given in Algorithm 1. 

In Algorithm 1, the first step is to sample the latent variables given the 
observed data. Then, based on this sample, we calculate the birth and death 
rates and waiting times. The birth and death rates are used to calculate the 
jump type. Details of how to efficiently calculate the birth and death rates are 
discussed in Section [2.2.3| Finally in step 3, according to the jump, we sample 
a new precision matrix using a direct sampling scheme from the G-Wishart 
distribution which is described in Algorithm 2 in Section [2? 

For our algorithm, the Rao-Blackwellized sample mean 
subsection 2.5) provides an effective way to estimate the posterior probability of 
each graph. The Rao-Blackwellized estimate of the posterior graph probability 
is the proportion to the total waiting times for that graph (see Figure]^ in the 
right). The waiting times for each graph act as the weights of that graph (e.g. 
{IFi, IT 2 , IF 3 ,...} in Figurein the left). 


2.4[ _ 

(Cappe et ah, 2003 


2.2.3 Computing the birth and death rates 

Calculating the birth and death rates § and Q has been a major bottle-neck of 
the EBDMCMC algorithm. Here, we explain how to resolve the computational 
bottle-neck and come-up with an efficient way to calculate the death rates; the 
birth rates are calculated in a similar manner. 


















Algorithm 1 Given a graph G = (V, E) with a precision matrix K, iterate the 
following steps: 

1. Sample the latent data. For each r gV and j G {1, 2,n}, we update the 

latent value from its full conditional distribution 

Zr\K, = 4 .V\{r} , ( 8 ) 

truncated to the interval in (|^. This sampling step can be easily modified 
to handle data that are missing-at-random. That is if yr is missing, then 
the full conditional Z^lK, is the untruncated multivariate normal 

distribution given in Q. 

2 . Sample the graph based on birth and death process. 

2.1. Calculate the birth rates by equationand (3{K) = l^e{K), 

2.2. Calculate the death rates by equation 0 and d(A) = Ee6£;^e(A), 

2.3. Calculate the waiting time by W{K) = l/(/3(A) + S{K)), 

2.4. Calculate the jump type (birth or death), 

3. Sample the new precision matrix, according to the jump type, based on 

Algorithm 2. 


Following Mohammadi and Wit (2015b) and after some simplification, for 

( 9 ) 


each e = (i, j) € E, we have 


D* 


P{G) Ia-.{b,D) 27r(fcij-fc(i) 




where 

H{K,D*)=exp< 
in which 




- K^)) - {D*, - - k\y 


= 


.9 Kj,v\iiKv\j,v\j) Kv\j,j\' 


and = ATg y\e(iFy\g y\g) ^Aiy^g g. The computational bottle-neck in is 
the ratio of normalizing constants, ■ 


Dealing with calculation of normalizing constants. Calculating the ra- 
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Figure 2: This image visualizes Algorithm 1. (On top) Mixed observed data transfor¬ 
mation using the copula to sample the latent variables, (Bottom left) Continuous time 
EBDMCMC algorithm where {Wi, W 2 ,...} denote waiting times and {ti,t 2 ,...} denote 
jumping times. (Bottom right) Estimated posterior probability of the graphs which are 
proportional to sum of their waiting times. 


Cheng et al. (2012 ), [Mohaniniadi and Wit (201hb ) developed an alternative ap¬ 
proach, which borrows ideas from the exchange algorithm (Murray et ah, 20121 


and the double Metropolis-Hastings (MH) algorithm (Liang, 20101 to compute 
the ratio of such normalizing constants. However, it has been observed that 
in case of high dimensional setting (large p), the double MH sampler become 
computationally inefficient due to the curse of dimensionality. To remedy this, 
more recently Uhler et al. (2014| theorem 3.7) derived an explicit representation 
of the normalizing constant ratio. 

Theorem 2.1 ( [Uhler et al. 2014 ). let G = {V,E) he an undirected graph and 
G~^ = (U,if“®) denotes the graph G with one less edge e. Then 






■d+l)/2) 


n{b+d)/2) 


where d denotes the number of triangles formed by the edge e and two other 
edges in G and Ip denotes an identity matrix with p dimension. 

Therefore, for the case oi D = Ip, we have a simplified expression for the 
death rates in which is given by 


5e{K) = 


2D 


P{G-^)T{{b + d+l)/2) 
P(G) m + d)/2) 


33 




2.2.4 Sampling from posterior distribution of precision matrix 

Several sampling methods from a G-Wishart have been proposed for sampling 


the precision matrix; for a review of existing methods see Wang and Li (20121, 
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). Here we use an exact sampler algo¬ 
determining the graph structure and 
its precision matrix using Algorithm 1 and summarized the details in Algorithm 


Mitsakakis et al. (20111, Wang et al. (2010 


rithm developed by Lenkoski (20131 after 


2 . 


Algorithm 2. Direct sampler from precision matrix (Lenkoski, 20131. 
Given a graph G = (V, E) with precision matrix K determined by the jump 
type (birth for inclusion of an edge and death for exclusion of an edge) from 
Algorithm 1: 


1. Set E = AT- 


2. Repeat for j = 1, until convergence: 


2.1 Let Nj C H be the set of variables that connected to j in G. 
Form SjVj and ^ and solve 



K 




2.2 Form jSj G ^ by plugging zeroes in those locations not connected 
to j in G and padding the elements of /3* to the rest locations, 

2.3 Replace ^j,-j and E_jj with 

3. Return K = E“^. 


2.2.5 Simulation study 


We perform a comprehensive simulation study with respect to different graph 
structures to evaluate the performance of the proposed EBDMCMC method 
and compare it to an alternative approach proposed by Dobra and Lenkoski 
(Dobra and Lenkoski, 20111, referred to as DL. We generate mixed data from a 
latent Gaussian copula model with 5 different types of variables, that includes 
“Gaussian”, “non-Gaussian”, “ordinal”, “count”, and “binary”. We performed 
all computations with our R package BDgraph (Mohammadi and Wit, 2015a). 

Gorresponding to different sparsity patterns, we consider 3 different kinds of 
synthetic graphical models, having p nodes: 


1. Random Graph: A graph in which the edge set E is randomly generated 
from independent Bernoulli distributions with probability 2/(p — 1) and 
corresponding precision matrix is generated from K ~ VFg( 3, Jp). 

2. Cluster Graph: A graph in which the number of clusters is max {2, [p/20]}. 
Each cluster has the same structure as a random graph. The corresponding 
precision matrix is generated from K ~ Wai^, Ip)- 


3. Scale-free Graph: A scale-free graph has a power-low degree distribution 
generated by the Barabasi-Albert algorithm (Albert and Barabasi, 20021. 
The corresponding precision matrix is generated from K ~ H/3(3, Ip)- 


Figure [^presents the patterns for the 3 graph structures with p = 40 nodes. For 
each graphical model, we consider different scenarios based on different number 
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of variables p = {10,20,30,40} and different sample size; see Table For each 
scenario, we generate mixed data and fit our proposed EBDMCMC and DL 
approaches using a uniform prior for the graph, G, and the G-Wishart prior 
Wc{3,Ip) for the precision matrix, K. We run the two algorithms with the 
same starting points using 100, 000 iterations and 50, 000 iterations as a burn- 
in. Computations for these scenarios were performed in parallel on a 235 batch 
nodes with 12 cores and 24 GB of memory, running Linux. 



Figure 3: An illustration of the 3 simulated graph structures with p 
simulation example 


2.2.5 


40 nodes for our 


To assess the performance of the graph structure, we compute the Ei-score 


measure (Powers, 2011) which is defined as 


El-score = 


2TP 


2TP -H FP -h FN ’ 


( 10 ) 


where TP, FP, and FN are the number of true positives, false positives, and 
false negatives, respectively. The Ei-score lies between 0 and 1, where 1 stands 
for perfect identification and 0 for bad identification. Also, we use the mean 
square error (MSE), defined as 


MSE = ^ (Pe - I{e G Gtrue)f, (H) 

e 

where is the posterior edge inclusion probabilities and /(e G Gtme) is an 
indicator function, such that J(e G Gtrue) = 1 if e G Gtme and zero oth¬ 
erwise. We calculate the posterior edge inclusion probabilities based on the 
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Rao-Blackwellization (Cappe et al., 2003, subsection 2.5) for each possible edge 
e = {i,j) as 


Pe = 




EliWiKM) 


( 12 ) 


where N is the number of iterations and ) is the waiting time for the 

graph with the precision matrix 


Fl-score MSE 


p 

n 

graph 

EBDMCMC 

DL 

EBDMCMC 

DL 



Random 

0.52 (0.16) 

0.38 (0.15) 

6.44 (2.35) 

6.33 (1.8) 

10 

30 

Cluster 

0.58 (0.14) 

0.43 (0.14) 

5.12 (1.75) 

5.38 (1.3) 

Scale-free 

0.53 (0.18) 

0.43 (0.13) 

6.46 (2.03) 

6.43 (1.31) 



random 

0.71 (0.15) 

0.67 (0.14) 

3.96 (1.73) 

4.1 (1.42) 

10 

100 

Cluster 

0.68 (0.16) 

0.67 (0.16) 

3.84 (1.54) 

3.49 (1.14) 

Scale-free 

0.67 (0.14) 

0.63 (0.14) 

4.5 (2) 

4.16 (1.18) 



random 

0.55 (0.08) 

0.45 (0.06) 

23.28 (5.28) 

24.04 (4.00) 

20 

70 

Cluster 

0.55 (0.09) 

0.47 (0.06) 

23.67 (5.36) 

21.84 (3.22) 

Scale-free 

0.49 (0.13) 

0.39 (0.08) 

19.97 (6.29) 

14.22 (1.96) 



random 

0.74 (0.07) 

0.61 (0.07) 

15.10 (3.63) 

17.27 (4.03) 

20 

200 

Cluster 

0.73 (0.07) 

0.62 (0.07) 

14.07 (4.00) 

14.58 (3.87) 

Scale-free 

0.67 (0.10) 

0.55 (0.08) 

11.3 (3.68) 

9.38 (1.84) 



Random 

0.54 (0.06) 

0.44 (0.04) 

52.3 (9.9) 

59.1 (8.7) 

30 

100 

Cluster 

0.56 (0.05) 

0.47 (0.04) 

48.0 (6.5) 

54.4 (8.1) 

Scale-free 

0.53 (0.17) 

0.30 (0.05) 

27.7 (14.6) 

25.8 (1.7) 



Random 

0.79 (0.04) 

0.63 (0.07) 

25.8 (6.5) 

41.1 (14.3) 

30 

500 

Cluster 

0.79 (0.05) 

0.66 (0.05) 

26.3 (5.2) 

35.1 (7.9) 

Scale-free 

0.81 (0.07) 

0.59 (0.06) 

9.4 (3.2) 

11.7 (3.0) 



Random 

0.71 (0.03) 

0.57 (0.04) 

61.84 (7.77) 

81.77 (14.25) 

40 

400 

Cluster 

0.71 (0.04) 

0.59 (0.04) 

58.25 (6.63) 

69.4 (13.31) 

Scale-free 

0.67 (0.09) 

0.46 (0.06) 

23.26 (8.37) 

23.29 (6.13) 



Random 

0.77 (0.03) 

0.62 (0.06) 

50.07 (8.39) 

76.22 (20.43) 

40 

800 

Cluster 

0.80 (0.03) 

0.68 (0.03) 

43.98 (5.93) 

58.95 (7.27) 

Scale-free 

0.78 (0.07) 

0.55 (0.05) 

15.05 (5.14) 

17.22 (5.4) 


Table 1: Summary of performance measures in simulation example |g.g.5| for our 
method and DL {Dobra and Lenkoski, 2011). The table presents the Fi-score, de¬ 
fined in (101 and MSE, defined in (111, with 50 replications and standard deviations 
in parenthesis. The Fi-score reaches its best score at 1 and its worst at 0. The MSE 
is positive value for which 0 is minimal and smaller is better. The best models for both 
Fl-score and MSE are boldfaced. 


Tablereports comparisons of the EBDMCMC method with DL, where we 
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repeat the experiments 50 times and report the average -Fi-score and MSE with 
their standard errors in parentheses. Our EBDMCMC method performs well 
overall as its Fi-score is larger and its MSE is lower than the DL method in 
most of the cases, mainly because of its fast convergence rate. As we expected, 
the DL approach converges more slowly compared to our method. 

Another way to check the performance of EBDMCMC algorithm and com¬ 
pare it with DL is based on the ROC curves as shown in Figures and 
The ROC curves depicting the performance of edge selection for each simulated 
graphs for the case p = 10 and n = 10, by varying thresholds on the pos¬ 
terior edge inclusion probabilities. As can be seen from the ROC curves the 
EBDMCMC performs better than the DL method. 


Random p=10 n=30 


Cluster p=10n=30 


Scale-free p=10n=30 





Random p=10 n=100 


Cluster p=10n=100 


0.0 0.2 0.4 0.6 0.8 

Scale-free p=10n=100 




0.0 0.2 0.4 0.6 0.8 1.0 


0.0 0.2 0.4 0.6 0.8 1.0 


Figure 4' ROC curves depicting the performances of the simulated graphs for the case 
p = 10 and n = {30,100}, based on the posterior edge inclusion probabilities. 


3 Analysis of Dupuytren disease data 


Here we analyze the data collected on patients who have Dupuytren disease in 
both hands from north of Netherlands by the Department of Plastic Surgery 
of the University Medical Center Groningen. The data are originally described 
by Lanting et al. (20131 and Lanting et al. (2014). The data consist of 279 
patients who have Dupuytren disease {n = 279); among those patients, 79 of 
them have an irreversible flexion contracture in at least on one of their fingers. 
Therefore, the data consist of a lot of zeros as shown in Figure That is, 
though the hands, are affected by the disease, the fingers didn’t show any sign 
of contraction and the total angle measure is taken as 0. 
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Random p=20 n=70 


Cluster p=20 n=70 


Scale-free p=20 n=70 






Figure 5: ROC curves depicting the performances of the simulated graphs for the case 
p = 20 and n = {70, 200}, based on the posterior edge inclusion probabilities. 


The severity of the disease in all 10 fingers of the patients is measured by 
the angles of each finger which is the sum of angles for metacarpophalangeal 
joints. To study the potential phenotype risk factors of Dupuytren disease, we 
consider potential phenotype risk 13 factors. These are smoking habits (Smok¬ 
ing), alcohol consumption (Alcohol), whether participants performed manual 
labour during a significant part of their life (Labour), whether they had sus¬ 
tained hand injury in the past including surgery (Handinjury), disease history 
information about the presence of Ledderhose, Diabetes, Epilepsy, Pey¬ 
ronie, Knuckle pad, and liver disease (LiverDiseas) and familial occurrence 
of Dupuytren disease which is defined as a first-degree relative with Dupuytren 
disease (Relative). 

For each finger we measure angles of metacarpophalangeal joints, two in- 
terphalangeal joints (for thumb fingers we only measure two interphalangeal 
joints); then we sum those angles for each fingers as a measure of the severity 
of Duptytren disease. The total angles ideally can vary from 0 to 270 degrees; 
however, in this dataset the minimum degree is 0 and maximnm 157 degrees. 
The age of participants (in years) ranges from 40 to 89 years, with an average 
age of 66 years. Smoking is binned into 3 ordered categories (never, stopped, 
and smoking). Amount of alcohol consumption is binned into 8 ordered cate¬ 
gories (ranging from no alcohol to more than 20 consumption per week). All 
other variables are binary. 

In Section [3.1[ we infer the Dupuytren disease network with 13 potential risk 
factors based on the EBDMCMC approach. In Section |3.2[ we consider only 
the severity measurements of the 10 hngers to infer the interaction among the 
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(a) Boxplot (b) Frequency histogram 


Figure 6: (a) Angles of the 10 fingers of all 279 patients, (b) Frequency histogram of 
rays affected with Dupuytren disease for all 10 fingers. 


fingers. 

3.1 Inference for Dnpuytren disease with risk factors 

We apply our Bayesian framework to infer the conditional (in) dependence among 
the 23 variables and to identify the potential risk factors of Dupuytren disease 
and discover how they affect the disease. In implementing the proposed EBDM- 
CMC approach to analyze this dataset, we place a uniform distribution as an 
uninformative prior on the graph and the G-Wishart prior Wg{5, I 23 ) on the 
precision matrix. We run the EBDMCMC algorithm for 2, OOOK iterations with 
a IjOOOK sweeps burn-in. The results are displayed in Figuresand 

The graph with the highest posterior probability is the graph with 42 edges. 
Figure visualizes 26 edges that have posterior probabilities large than 0.5. 
Similarly, Figure shows the image of all posterior inclusion probabilities where 
the degree of darkness increases with increasing posterior probabilities. 

The edges in the graph show the interactions among the 10 severity mea¬ 
surements of Duputren disease and 13 risk factors. For example, the results 
(Figures and show that factors “Age”, “Alcohol”, “Ledderhose disease”, 
“Hand Injury” and “Relative”, among those 13 risk factors, have a significant 
association with the severity of Dupuytren disease. Graph also shows that 
factor “Age” is a hub in this graph and it plays a significant role as it affects 
the severity of the disease directly and indirectly through the influence of other 
risk factors such as “Ledderhose”. 

Further we checked the stability of the selected graph with highest posterior 
probability at convergence of the algorithm with 100 different starting points. 
The resulting Figure shows the traces of number of edges in the estimated 
graphs plotted against iterations of the EBDMCMC algorithm with the 100 
different starting points. The plot shows good mixing around a stable graph 
model size which is 42 and the algorithm converges after around 300 iterations. 
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Peyronie 


Handinjury 


Right 1 



Epilepsy 


JverDisease 


Left2 


Leftl 


Figure 7/ The inferred graph for the Dupuytren disease dataset based on 13 risk factors 
and the total degrees of flexion in all 10 fingers. It reports the selected graph with 26 
edges for which their posterior inclusion probabilities (121 are more than 0.5. Sign 
“+” shows a positive relationship between nodes and shows a negative relationship. 


3.2 Severity of Dupuytren disease between pairs of fingers 


Here, we consider the relationship between incurrence of Dupuytren disease in 
pairs of fingers on both right and left hands. Interaction between fingers is 
important because it help surgeons to decide whether they should operate one 
finger or multiple fingers simultaneously. The main idea is that if fingers are 
almost independent in terms of the severity of Dupuytren disease, there is no 
reason to operate the fingers simultaneously. On the other hand, if there is a 
strong relationship between fingers, then joint surgery may be recommended if 
one of the fingers is affected. 

We apply the proposed EBDMCMC approach for the 10 variables of Dupuytren 
disease severity measures using a uniform distribution as an uninformative prior 
on the graph and the G-Wishart Iw) prior on the precision matrix. We 

run the EBDMCMC algorithm for 2, OOOK iterations with a 1, OOOK as burn-in. 
The results are displayed in Figure [T0| and Figure [TH 

The graph with the highest posterior probability is the graph with 12 edges. 
Figure [T^ visualizes the selected graph with 8 edges, for which the posterior 


inclusion probabilities in (121 are greater than 0.5. The edges in the graph show 
the interactions among the fingers with regard to the severity of Dupuytren 
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Posterior Edge Inclusion Probabilities 


Sex 

Age 

Labour 

Smoking 
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Diabetes 

Epilepsy 

LiverDisease 

Peyronie 

Ledderhose 

Knucklepads 

Handinjury 

Relative 

Righti 

Right2 
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Right4 
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Lefti 

Lefts 

Lefts 

Left4 

Lefts 



0.8 


0.6 


0.4 


0.2 
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Figure 8: Image visualization of the posterior edge inclusion probabilities of all possible 
edges in the graph, for 10 fingers with 13 risk factors. 


disease. 

The results (Figures [To|and[Tl]) show signihcant co-occurrences of Dupuytren 
disease in the ring fingers and middle fingers in both hands. This suggests that 
presence of disease in the middle finger is strongly associated to the ulnar side 
of the hand. Surprisingly, our results also show a strong relationship between 
middle fingers in both hands. Moreover, the results show that the joint interac¬ 
tions between fingers in both hands are almost symmetric. These results are in 
support of the hypotheses that the disease has genetic factors or other biological 
factors that affect similar fingers in both hands 

3.3 Fit of model to Dupuytren data 

Posterior predictive checks can be used for checking whether the proposed 
Bayesian approach fits the Dupuytren data set well or not. If the model fits the 
Dupuytren data, then simulated data generated under the model should look 
like to the observed data. In this regard, first, based on our estimated graph 
from the EBDMCMC algorithm in section |3.1[ we draw simulated data from 
the posterior predictive distribution. Then, we compare the samples to our ob¬ 
served data. Any systematic differences between the simulations and the data 
determine potential failings of the model. 

We obtain the conditional distributions of the potential risk factors and 
disease severity measures on the fingers for both simulated and observed data. 
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Trace plot for convergence check 
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Figure 9: Case study of Section 


3.1 


Trace plot of the number of edges included in 


the estimated graphs against iterations of the EBDMCMC algorithm with 100 different 
starting points. 



Figure 10: The inferred graph of Dupuytren disease dataset based on the total degrees 
of flexion in all 10 fingers. It reports the selected graph with 8 edges for which their 
posterior inclusion probabilities (121 are more than 0.5. 


The empirical and predictive conditional distributions of some selected variables 
are presented in Figures [T^ [T3| and [14| 

Figure [^displays the empirical and predictive distributions of disease sever¬ 
ity measure on finger 4 in right hand (rightd) conditional on variable “age” in 
four categories {(40, 50), (50,60), (60, 70), (70, 90)}. The variable right4, based 
on Tubiana Classification, grouped into 5 categories, category 1: 0 degree for 
total angle; 2: degree between (1,45); 3: degree between (46,90); 4: degree 
between (90,135); 5: degree more than 135. The results in Figure 12 show that 
the fit is good, since the predicted conditional distributions, in general, are the 
same with the empirical distributions. 
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Figure 11: Image visualization of the posterior edge inclusion probabilities of all pos¬ 
sible edges in the graph, for 10 fingers. 


Similarly, Figure fT^ plots the empirical and predictive distribution of disease 
severity measure on finger 5 in right hand (rights) conditional on variable “Rel¬ 
ative” and Figure \T^ plots the empirical and predictive distribution of disease 
severity measure on finger 2 in right hand (right2) conditional on variable “Led- 
derhose”. These results also suggest that the EBDMCMC algorithm fits the 
Dupuytren data well as the predicted conditional distributions are in agreement 
with the empirical distributions. 


4 Conclusion 

In this paper we have implemented a Bayesian method for discovering the effect 
of mixed potential risk factors of Dupuytren disease and the underling relation¬ 
ships between fingers on both right and left hands with regard to severity of the 
Dupuytren disease. 

The results of the case study clearly demonstrate that age, alcohol, relative, 
and ledderhose diseases all affect Dupuytren disease directly. Other risk fac¬ 
tors only affect Dupuytren disease indirectly. Another important result is that 
severity of Dupuytren disease in fingers are correlated. In particular, the middle 
finger with the ring finger. This implies that a surgical intervention on either 
the ring or middle finger should preferably be executed simultaneously. 

In Sectionj^ based on our Bayesian framework, we model the Dupuytren dis¬ 
ease by consider the 13 potential risk factors. Our result in this section support 
the hypotheses the disease has genetic factors or other biological factors that af¬ 
fect the severity of the disease in fingers simultaneously. Indeed, in genome-wide 
association study, nine genes were identified to be associated with Dupuytren 
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Figure 12: Empirical and predictive conditional distributions for total angle of finger 
4 tn right hand condition on four different categories of variable “age”. 
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Figure 13: Empirical and predictive conditional distributions for total angles of finger 
5 in right hand condition on relative variable. 


disease (Dolmans et al., 2011). It should be interesting to consider the poten¬ 
tial environmental risk factors jointly with those biological factors. Bayesian 
inference for all these factors requires a computationally efficient search algo¬ 
rithm that can potentially explore the underlying graph structure to uncover 
complicated patterns among these variables. 

We compare our EBDMCMC Bayesian approach with an alternative Bayesian 
approach (Dobra and Lenkoski, 20111 using a simulation study on various types 
of graph structures. Although, both approaches converge to the same posterior 
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Figure 14: Empirical and predictive conditional distributions for total angles of finger 
2 in right hand condition on Ledderhose disease variable. 


distribution our approach has some clear advantages on finite MCMC runs. This 
difference is mainly due to our implementation of a computationally efficient al¬ 
gorithm. Our method is computationally more efficient because of two reasons. 
Firstly, our sampling algorithm is based on birth-death process. Secondly, we 
implemented an exact way of computing for the ratio of normalizing constants 
based on the result in Uhler et al. (20141, which has been computationally a 
bottle-neck in the Bayesian approach. 

Of course, our extended Bayesian method is not limited only to this type 
of data. It can potentially be applied to any kind of mixed data where the 
observed variables are binary, ordinal or continuous. 
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